{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "za-eT0AZAVEi"
      },
      "source": [
        "# RAG with llama.cpp and external API services\n",
        "\n",
        "[txtai](https://github.com/neuml/txtai) is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.\n",
        "\n",
        "txtai has been and always will be a local-first framework. It was originally designed to run models on local hardware using Hugging Face Transformers. As the AI space has evolved over the last year, so has txtai. Additional LLM inference frameworks have been available for a while using llama.cpp and external API services (via LiteLLM). Recent changes have added the ability to use these frameworks for vectorization and made it easier to use for LLM inference.\n",
        "\n",
        "This notebook will demonstrate how to run retrieval-augmented-generation (RAG) processes (vectorization and LLM inference) with llama.cpp and external API services."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2e6iFXweAVEk"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-UwwKDHCAVEk"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "\n",
        "# Install txtai and dependencies\n",
        "!pip install llama-cpp-python[server] --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-llm]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YGu8d_niAVEk"
      },
      "source": [
        "# Embeddings with llama.cpp vectorization\n",
        "\n",
        "The first example will build an Embeddings database backed by [llama.cpp](https://github.com/ggerganov/llama.cpp) vectorization.\n",
        "\n",
        "The llama.cpp project states: _The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud_.\n",
        "\n",
        "Let's give it a try."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gZgObbS0AVEl"
      },
      "outputs": [],
      "source": [
        "from txtai import Embeddings\n",
        "\n",
        "# Create Embeddings with llama.cpp GGUF model\n",
        "embeddings = Embeddings(\n",
        "    path=\"second-state/All-MiniLM-L6-v2-Embedding-GGUF/all-MiniLM-L6-v2-Q4_K_M.gguf\",\n",
        "    content=True\n",
        ")\n",
        "\n",
        "# Load dataset\n",
        "wikipedia = Embeddings()\n",
        "wikipedia.load(provider=\"huggingface-hub\", container=\"neuml/txtai-wikipedia\")\n",
        "\n",
        "query = \"\"\"\n",
        "SELECT id, text FROM txtai\n",
        "order by percentile desc\n",
        "LIMIT 10000\n",
        "\"\"\"\n",
        "\n",
        "# Index dataset\n",
        "embeddings.index(wikipedia.search(query))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qD2jypWnAVEl"
      },
      "source": [
        "Now that the Embeddings database is ready, let's run a search query."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QhGrDP7AAVEl",
        "outputId": "42223d8d-a33e-47f6-9de3-e24cec4b31b0"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': 'Thomas Edison',\n",
              "  'text': 'Thomas Alva Edison (February 11, 1847October 18, 1931) was an American inventor and businessman. He developed many devices in fields such as electric power generation, mass communication, sound recording, and motion pictures. These inventions, which include the phonograph, the motion picture camera, and early versions of the electric light bulb, have had a widespread impact on the modern industrialized world. He was one of the first inventors to apply the principles of organized science and teamwork to the process of invention, working with many researchers and employees. He established the first industrial research laboratory.',\n",
              "  'score': 0.6758285164833069},\n",
              " {'id': 'Nikola Tesla',\n",
              "  'text': 'Nikola Tesla (; , ;  1856\\xa0– 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is best-known for his contributions to the design of the modern alternating current (AC) electricity supply system.',\n",
              "  'score': 0.6077840328216553},\n",
              " {'id': 'Alexander Graham Bell',\n",
              "  'text': 'Alexander Graham Bell (, born Alexander Bell; March 3, 1847 – August 2, 1922) was a  Scottish-born Canadian-American inventor, scientist and engineer who is credited with patenting the first practical telephone. He also co-founded the American Telephone and Telegraph Company (AT&T) in 1885.',\n",
              "  'score': 0.4573010802268982}]"
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ],
      "source": [
        "embeddings.search(\"Inventors of electric-powered devices\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QyaqgY8QAVEm"
      },
      "source": [
        "As we can see, this Embeddings database works just like any other Embeddings database. The difference is that it's using a llama.cpp model for vectorization instead of PyTorch."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cewP6JCEAVEm"
      },
      "source": [
        "# RAG with llama.cpp\n",
        "\n",
        "LLM inference with llama.cpp is not a new txtai feature. A recent change added support for conversational messages in additional to standard prompts. This abstracts away having to understand prompting formats.\n",
        "\n",
        "Let's run a retrieval-augmented-generation (RAG) process fully backed by llama.cpp models.\n",
        "\n",
        "_It's important to note that conversational messages work with all LLM backends supported by txtai (transformers, llama.cpp, litellm)._"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "XTNX1mtMAVEm",
        "outputId": "62d92e44-02eb-4695-a507-0fc9729f0b57"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Based on the given context, here's a list of invented electric-powered devices:\n",
            "\n",
            "1. Electric light bulb by Thomas Edison\n",
            "2. Phonograph by Thomas Edison\n",
            "3. Motion picture camera by Thomas Edison\n",
            "4. Alternating current (AC) electricity supply system by Nikola Tesla\n",
            "5. Telephone by Alexander Graham Bell\n"
          ]
        }
      ],
      "source": [
        "from txtai import LLM\n",
        "\n",
        "# LLM instance\n",
        "llm = LLM(path=\"TheBloke/Mistral-7B-OpenOrca-GGUF/mistral-7b-openorca.Q4_K_M.gguf\")\n",
        "\n",
        "# Question and context\n",
        "question = \"Write a list of invented electric-powered devices\"\n",
        "context = \"\\n\".join(x[\"text\"] for x in embeddings.search(question))\n",
        "\n",
        "# Pass messages to LLM\n",
        "response = llm([\n",
        "    {\"role\": \"system\", \"content\": \"You are a friendly assistant. You answer questions from users.\"},\n",
        "    {\"role\": \"user\", \"content\": f\"\"\"\n",
        "Answer the following question using only the context below. Only include information specifically discussed.\n",
        "\n",
        "question: {question}\n",
        "context: {context}\n",
        "\"\"\"}\n",
        "])\n",
        "print(response)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RQSApBO_AVEm"
      },
      "source": [
        "And just like that, RAG with llama.cpp🦙!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qrf44ulzAVEn"
      },
      "source": [
        "# Embeddings with external vectorization\n",
        "\n",
        "Next, we'll show how an Embeddings database can integrate with external API services via [LiteLLM](https://github.com/BerriAI/litellm) .\n",
        "\n",
        "In the LiteLLM project's own words: _LiteLLM handles loadbalancing, fallbacks and spend tracking across 100+ LLMs. All in the OpenAI format._\n",
        "\n",
        "Let's first startup a local API service to use for this demo."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iRoh9rPEAVEn"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "\n",
        "# Download models\n",
        "!wget https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-Q4_K_M.gguf\n",
        "!wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf\n",
        "\n",
        "# Start local API services\n",
        "!nohup python -m llama_cpp.server --n_gpu_layers -1 --model all-MiniLM-L6-v2-Q4_K_M.gguf --host 127.0.0.1 --port 8000 &> vector.log &\n",
        "!nohup python -m llama_cpp.server --n_gpu_layers -1 --model mistral-7b-openorca.Q4_K_M.gguf --chat_format chatml --host 127.0.0.1 --port 8001 &> llm.log &\n",
        "!sleep 30"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fPnpAMnvAVEo"
      },
      "source": [
        "Now let's connect and use this local service to generate vectors for a new Embeddings database. Note that the local service responds in OpenAI's response format, hence the `path` setting below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EpSLnVfwAVEo"
      },
      "outputs": [],
      "source": [
        "from txtai import Embeddings\n",
        "\n",
        "# Create Embeddings instance with external vectorization\n",
        "embeddings = Embeddings(\n",
        "    path=\"openai/gpt-4-turbo\",\n",
        "    content=True,\n",
        "    vectors={\n",
        "        \"api_base\": \"http://localhost:8000/v1\",\n",
        "        \"api_key\": \"sk-1234\"\n",
        "    }\n",
        ")\n",
        "\n",
        "# Load dataset\n",
        "wikipedia = Embeddings()\n",
        "wikipedia.load(provider=\"huggingface-hub\", container=\"neuml/txtai-wikipedia\")\n",
        "\n",
        "query = \"\"\"\n",
        "SELECT id, text FROM txtai\n",
        "order by percentile desc\n",
        "LIMIT 10000\n",
        "\"\"\"\n",
        "\n",
        "# Index dataset\n",
        "embeddings.index(wikipedia.search(query))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "teoKbfzZAVEo",
        "outputId": "55594186-c989-419e-974a-37a150ba4734",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': 'Thomas Edison',\n",
              "  'text': 'Thomas Alva Edison (February 11, 1847October 18, 1931) was an American inventor and businessman. He developed many devices in fields such as electric power generation, mass communication, sound recording, and motion pictures. These inventions, which include the phonograph, the motion picture camera, and early versions of the electric light bulb, have had a widespread impact on the modern industrialized world. He was one of the first inventors to apply the principles of organized science and teamwork to the process of invention, working with many researchers and employees. He established the first industrial research laboratory.',\n",
              "  'score': 0.6758285164833069},\n",
              " {'id': 'Nikola Tesla',\n",
              "  'text': 'Nikola Tesla (; , ;  1856\\xa0– 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is best-known for his contributions to the design of the modern alternating current (AC) electricity supply system.',\n",
              "  'score': 0.6077840328216553},\n",
              " {'id': 'Alexander Graham Bell',\n",
              "  'text': 'Alexander Graham Bell (, born Alexander Bell; March 3, 1847 – August 2, 1922) was a  Scottish-born Canadian-American inventor, scientist and engineer who is credited with patenting the first practical telephone. He also co-founded the American Telephone and Telegraph Company (AT&T) in 1885.',\n",
              "  'score': 0.4573010802268982}]"
            ]
          },
          "metadata": {},
          "execution_count": 7
        }
      ],
      "source": [
        "embeddings.search(\"Inventors of electric-powered devices\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0X1F1KcNAVEo"
      },
      "source": [
        "Like the previous example with llama.cpp, this Embeddings database behaves exactly the same. The main difference is that content is sent to an external service for vectorization."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Dr4HjhOAVEo"
      },
      "source": [
        "# RAG with External API services\n",
        "\n",
        "For our last task, we'll run a retrieval-augmented-generation (RAG) process fully backed by an external API service."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "42sPo6HCAVEo",
        "outputId": "7e42e28a-0bdd-4f68-a1ee-403527a241fd",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Based on the given context, a list of invented electric-powered devices includes:\n",
            "\n",
            "1. Phonograph by Thomas Edison\n",
            "2. Motion Picture Camera by Thomas Edison\n",
            "3. Early versions of the Electric Light Bulb by Thomas Edison\n",
            "4. AC (Alternating Current) Electricity Supply System by Nikola Tesla\n",
            "5. Telephone by Alexander Graham Bell\n"
          ]
        }
      ],
      "source": [
        "from txtai import LLM\n",
        "\n",
        "# LLM instance\n",
        "llm = LLM(path=\"openai/gpt-4-turbo\", api_base=\"http://localhost:8001/v1\", api_key=\"sk-1234\")\n",
        "\n",
        "# Question and context\n",
        "question = \"Write a list of invented electric-powered devices\"\n",
        "context = \"\\n\".join(x[\"text\"] for x in embeddings.search(question))\n",
        "\n",
        "# Pass messages to LLM\n",
        "response = llm([\n",
        "    {\"role\": \"system\", \"content\": \"You are a friendly assistant. You answer questions from users.\"},\n",
        "    {\"role\": \"user\", \"content\": f\"\"\"\n",
        "Answer the following question using only the context below. Only include information specifically discussed.\n",
        "\n",
        "question: {question}\n",
        "context: {context}\n",
        "\"\"\"}\n",
        "])\n",
        "print(response)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kFcqP5A7AVEp"
      },
      "source": [
        "# Wrapping up\n",
        "\n",
        "txtai supports a number of different vector and LLM backends. The default method uses PyTorch models via the Hugging Face Transformers library. This notebook demonstrated how llama.cpp and external API services can also be used.\n",
        "\n",
        "These additional vector and LLM backends enable maximum flexibility and scalability. For example, vectorization can be fully offloaded to an external API service or another local service. llama.cpp has great support for macOS devices, alternate accelerators such AMD ROCm / Intel GPUs and has been known to run on Raspberry Pi devices.\n",
        "\n",
        "It's exciting to see the confluence of all these new advances coming together. Stay tuned for more!"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.19"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}